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Foreword 



.What is a norm? How does it affect the preparation and 
scoring o£ standardized tests? ' ' -i^ \ 

Adequate replies to these questions imply a sophistication 
that has yet to be achieved by students, parents, and teachers. 
The experts agree that the mystery of "norming** a test can be 
dispelled only by a clearer understanding of how norms affect 
the interpretation of test scores. That is ^vhy the National 
Association of Secondary-School Principals was advised to pre- 
pare a plainly written analysis of the importance of norms. 

In the search for someone who could clarify concepts and 
procedures about norms and normjng techniques, we asked 
Frank Womer of the University of Michigan to tackle the job. 
With a minimum of technical fuss, he shows how norms are 
arrived at and how they should be. reviewed. W^&^t^pe that his 
,work will, lead to increased clarity not only within the pro- 
fession but also among the intelligent, devoted citizens whose 
support is the cornerstone of all educational ^progress. 

\ » 

. ^ Ei,iiwoRTH Tompkins ' 

^ Exemlivc Secretary 

National Association of 
S^ondary-School Principals 
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What , This Booklet 
Is AH About 



!*What a big boy he isl" is a well worn phrase— used so often 
one might almost assume it had a common meaning for every- 
body. Yet we all know its real meaning depends largely on when 
or where it is used. If someone says it as he looks down ijiio a ' 
baby carriage, it probably means the boy is physically big in 
relation to other babies his own age. If it is said by a visitor who 
has not seen the boy for a while, it may only mean that he is 
surprisingly bigger than he was a year ago. If it is said by the 
football coach as he looks over prospective tackles on tfie college 
^Ireshman team, it may mean that the boy is already bigger than 
the typical college tackle. In each context the boy is big only in 
comparison with some standard of bigness— some norm— whether 
that norm be babies in general, his own former size, or the size 
of football tackles. . ^ 

In these pages we shall be talking a great deal about test 
scores and norms. And the very first point, we wish to make is 
that they, too, take their meaning largely from some kind of 
comparison. A pupil's test score is high or low only in relation 
to other pupils' scores, to his own previous scores, to a selection 
score indicating possible success in a certain college, or to some 
other criterion or standard for judging what is high or low. 



This discussion deals with the process of determining test 
norms and of using them realistically. It deals, then, with the 
' process of putting meaning into a single test score by relatingV^ 
it to other test scores, achieved by other pupils or by the same | 
pupil at other times. The focus of attention is on the meaning 
of test norms; it is not on the meaning of types of test scores. ^ 
For example, how norm tables Jhat yield IQ's are developed will 
be considered, but not the concept of the IQ itself. Some 
attention will be given to the actual development of perc^ntile 
norms, but the varioiis characteristics and facets qf percentiles 
>vill-not be discussed. 

Another delimiting feature of-this brochure is that the dis- 
cussion is centered around norms for widely used, nationally 
standardized tests, as distinct from teather-made classroom tests. 
Some of the principles discussed apply to both, but the" major 
purpose of this publication is clariBcation of national test norms. 



. ... 

' '/ • ' ; ' _ ' 

. . Some Preliminary ThougHts 

on the Purposes of Norms 

^ ■ ■ 

> 

To have an example by which to illustrate various aspects of 
test norms, let us run through the process of developing a 
standardized test. Call it the "XYZ Arithmetic Test," lor grades 
7, 8, and 9. Assume diat it consists of 25 multiple-choice items, 
that these were written to sample the arithmetic knowledges and 
skills one can reasonably expect junior high pupils to have 
developed, and that a good job of item writing was done. 

Let's say that tliis test was given to John Smith, an 8th grader, 
and that John answered 16 of the items correctly. From this 
information alone all we can say is th^t John got 16 items correct 
put of 25 possible. If we want to know anything about how 
John's performance compares with that of other eighth graders, 
we need information about the scores of other eigluh graders. 
If we want to know how John's performance compared widi his 
perform<^nce on a similar arithmetic test he took in grade 6, we 
need information as to his relative performance on that sixth 
grade test. If we want to know whether John's arithmetic test 
score is high enough to suggest that he juight be successful in 
algebra next year, we need to knpw the scores of other pupils 
who later elected afgebra and succeeded in it. 



To make significant judgments about John's score on the 
XYZ Arithmetic Test we must know the scores of other pupils 
on the same test. We need yardsticks based on the performance 
of many^pupils so that we can place John*s score on.those yard- 
sticks. In our example, we need three yardsticks: 1) the per- 
formance of other eighth graders on the XYZ Arithmetic Test;- 
2) the performance of other sixth graders on the same arith-* 
metic test thai-John took in grade 6; and 3) the performance on 
tlie XYZ Test of eighth graders who subsequently elected algebra 
and were successful in it. The entire process of developing test 
norms IS designed to provide yardsticks, so that the test scores 
of an individual or group can be compared to the performance 
of comparable individuals or groups. 

Test norms provide check points of performance in relation 
to other pupils— they say that John scored higher (or lower) 
than other eighth graders, that he scored higher (or lower) than 
other pupils succeeding in algebra the following^ yean Test 
norms do not tell whether John's performance was as good asN 
his teacher might hav^ desired, or whether he had mastered 
^very concept in arithmetic to which eighth graders are exposed. 
They simply tell us where John stands in relation to the per- 
formance of ''known" japroups of pupils. 

Similarly, a measurement on a yardstick does not, in itself, 
establish tallness or shortness. It merely establishes height. A 
ten-year-old boy whose height is five feet is considered tall, but 
an adult man whose height is five feet is considered short. In 
, like fashion, a score of 16 on the XYZ Arithmetic Test may be 
average for eighth graders but above average f9r seventh graders. 
/TJie purpose of test norms is to add meaning to individual test 
scores in the same way that meaning is added to height by judg- 
ing tallnesslih relation to some known group of heights. 

V 

4 ' 
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- A Working Example 
of tRe Gevelopment of Norms 



We said we would assume that a good job of item writing was 
done for the XYZ Arithmetic Test. We need also to assutne that 
more than 25 items were written originally and thatj the larger 



ne a process galled item 
tems on a representative 
) determine which items 



group of arithmetic items had undergor 
analysis. This is done by "trying** the 
group of students. Item analysis helps tt 
in a test are good ones. "Good'* items are those which are 
answered correctly by most pupils receiving high scores on the 
total set of items, answered correctly by about half of the pupils 
receiving average scores, and answereci correctly by only a few 
pupils receiving low scores. Thus, cn an overall basis, good items 
are missed by about as many pupils as those that get them 
correct. Good items tend to separate pupils in the same way that 
the* total test separates them; that, is they discriminate well be- 
tween the better students and the poorer ones. Poor items are 
those that are answered- correcriy by almost as many pupils 
receiving low scor es asjb y those receiving high scores on the 
total test or set of items. Sometime a very poor item is answered 
successfully more often by poor students than by good ones; 
it discriminates negatively. . If one wants to be sure of having 
25 good items-afterJtem analysis, it is necessary to Write and try 

. ' ■ I ' . - ■ . 
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out perhaps as many as j50 items. Once a group of it^is is 
chosen, they are put together as a test, directions for administra- 
tion are deVeloped^afid the test is ready to be standardized' or 
normed. \ ' 

Arranging the Sample - , 

The first step in the standardization process is to define the 
groiip (the "population") of schools or' pupils ior which the 
test^is presumably appropriate. If the XYZ Arithmetic Test 
was developed so that it might be useful in any junior \}igh 
school in die United States, then we can define its population as 
all junior Jiigh pupils (grades 7, 8, 9) in the United Stafes. If 
the test had been designed primarily for use in independent, 
junior high schools, we would define its population as all junior 
high pupils enrolled in indeferident schools. If it, had been 
designed for a state testing pre gram in Oklahoma, the popula- 
tion would be all junior higl1| pupils enrolled in Oklahoma 
schools. The definition of thq^ appropriate, population for a 
test is dependent lipon the objectives of. the Author and the 
projected use of his test. \ 

Once a te^t population is defined, die next step is to select a 
sample grou^ of schools or pupils sfrom the defined population. 
In theory it/ woiild be ideal to selek a random sample of pupils 
for a nationally standardized test, ^n practice this' is not done 
because of^the difficulties of securing a sample consisting of a 
single pupil in some schools, twi^ pUpils in maiiy others, three 
in still other sc^)(ools> and so on.i It Would be very difficult to 
organize Jjiich a standardization even, if one could afford the 
enormous amount of money that it would cost. In practice, then, 
school systems generally are used as samples, arid attempts are 
madejo see^that each school system has an appropriate (pot 
equal) probability of being chosen. For example, a junior 
high sthool with an enrollment of 1,000 should have a greater 
chance of being in the sample than one with an enrollment of 

6 , • - 




itfO, so that each of the 1,000 pupils. would ^lave the same chance * 
as each of the 100. ^ . \» 

Schools generally are classified in two ways be|6rc the random 
selection is made* First, since there is ccmsiderable evl4eRce|hlt 
aVerage pupil performance is^ a bit higher ii^jsome geogr^phif 
areas than, in others, most samples^ of school systems for test 
norming are chosen sh that all geographic areas are represfented. 
^Sometimes eaqh state is represented. The se?:ond major classifica- 
tion is according to size of school system (or community). Again, 
average test performance varies somewhat from size to size of 
community. /It is felt that large communities, middle-sized^com- 
munitie^, and small communities all should b^^mpled in a 
norm group. ' x^^ 

PuttinVall tills together, we find that the author of our XYZ 
Arithmetic Test proceeded, as follows: \ / 

1. He defined his popu^atipn as all junior high pupils (graces- 
7, », 9) in the UnitedxStates. I ^ ' - 

* 2. From census data he listed aljp school systems separately by 
state., * . ■ • • • y 

3. Within each of the 50 state groups he separat^sd com- 
munities into four categories by -population: /(a) over 
500;000; (b) 50,000 to 500,000; (c) 5,000 to 50,000; and 
(d) below 5,000. This produced a few less than 200 groups 
(some states have no communities over 500^000). 

4. Within each of the nearly 200 groups, he chose, at random, 
-j one or more systems, depending upon the 'total popula- 
' tion in each group. Let's assume that he selected 3(^1 

systiems. \^ _ 

Remember that this example is ,for illustrative purposes and 
should not be interpreted too literally. In practice it mig|it nof 
be necessary to sampk every state; one m'ght choose different 
geographical divisions. In practice one might also treat certain 
communities in a special way (those over l,000,00frfor example). 
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One could also cite examples of tent standardization geared to 
socio-econpmic level of community, or examples in which specific 
school buildipgs (not systems) were chosen. But, regardless of 
ijfetails, the sample chosen for the XYZ Test would be fairly 
typical of samples chosen for test s des igned to be used nationally. 
Each sampling is an attempt to come as close as possible to giving 
each pupil in the defined population an equal (random) chance 
to be selected • - " '* ~ 

. Working xvith the Schools 

Once a norming sample has been selected, the next §tep is to 
seek the coopention of the chosen schools; In our example we 
theoreticalijLerfded up with 300 school systems. The next thing 
to be done would be to write a letter to the superintendent of 
school^' of each system (possibly with a carbon copy to each 
building principal having grades 7 and/or 8 and/or 9). ,The ^ 
letter would describe the XYZ Arithmetic Test, >vould point lip , 
the heed for a new arithmetic test and the Vareful construction . ■ 
of the XVZ Test. Further, it might point up the importance oi ^ 
local school systems In the continuing research on new tests and 
the distinction of being one of only 300 school; systems chosen. 
The letter would certainly offer to report all pupil scores back?^ 
to the school. It might or might not offer the tests t|iemselves A 
to the participating school, and it might or might ndt Juggest a \| 
modest fee for parlfcipation.in the norming-'program. In any 
ciyent, it Would asl^ the superintendent and/or principal (s) to 
agree to test their pi^pils w^ithjthe XYZ Arithmetic T^est. 

Now let us assumc^^that in response to the initial appeal, 180 
(60 per cent) agree to participate by testing their p^6pils with 
the XYZ Test. The question of adequacy of a 60 per cent sample 
is the next point to consider. If one can assume that the 40 per 
cent who say **Mo," or wlio don't even acknowledge the request, 
are no different from the 60 per cent who say ''Yes," \one can 
proceed to the actual testing. But, if one makes the assbnintion 
that the 60 per cent tend to be. systems that are a bit larger and 

\ 8 
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more dedicated tolSfearch, represent a somewhat higher socio- 
economic level, with sligmly abler students th^n the 40 per cent, 
one is apt to be closer to the truth. 

At this point several choices are ogen. One can proceed 
directly to testing, hoping that the 60 per cent will not be too 
atypical of all schools. Onccanjryto convince the "No" schools 
to say^**Yes."'^Or one can attempt to secure the cooperation ot 
other school systems like the ones that lesponded negatively, by 
choosing, at random, st substitute for each. One way or another,' 
attempts are generally m^de to move nearer to a complete/ 
sampling. How far one carries these efforts is determinedly^ 
the test author ^nd^s publisher* Let's assume that, in the case 
of the XYZ TesJ<^ substitute schools were chosen and half of the 
substitutes4gr^d^o cooperate. This would mean that 240 (180 . 
plus 60) ^iool systems actually agreed to test their pupils— 80 
per cent/of ^he sought-for 300. 

No inatter how far a test author ouches his sampling pro- 
cedures, he is not likely to secure iuO per cenf participation. 
Perhaps the best example of participation that one can point to 
is Project Talent. In that widely publicized research study 
approximately 95 per cent of the school systems selected did do\ 
the subs^(]uent testing. Since most nationally standardized tests 
do not achieve so high^a proportion, one can be sure that some 
error of measuremj^nt is present in each test standardization .be- 
cause of imperfect sampling. (Some error of measurement is 
ineyital>le because of the use of a sample rather than the entire^ 
population.) Some of the variation among norm^ of nationally 
standardized tests can be traced to the fact that no one does a 
perfect job of sampling a national population, it is aslcing 

too much of different tests to expect the resulting scores to be 
exactly interchangeable. 

The next step is the actual administration of the test. In this 
phase man^ teachers and counselors ate apt to be involved. In 
most cases the test administration is quite like that done ih a 
school's on-gding testing program. The test publisher establishes 




approximate dates for test administration; the local school picks 
an exact date. The test publisher sends sufficient test booklets, 
answer sheets, specUlj)eacils_(iE necessary), and directions for* 

" admiriistiitioh to each school; the school administers the test 
according to prescribed directions .and returns all answer sheets 
to the publishW. The publisher eventually reports pupil scores 
back to each 'school; /(This last step is irrelevant to the test 
standardizatkin itself. It is done for public relations purposes 
and to provide each school an incentive to participate.) 

At this point one assumes that the test h^s been administered 
uniformly in the various schools. One assumesCthat each teacher 
or counselor or administrator who gave th^ test did follow 
the printed instructions ^o the letter. This probably h a 
reasonable assumption 'for most schools. Fortunately, if and 
when errors do occur in the test administration, they will not 
necessarily arfect the norms, if part of (tie errori are of the type 
that would ^ise test scores (such as not calling time exactly and 
letting a group have an extra minute or two on a test) while 
others are of the type that would lower test scores (such as foiling 
to read directions accurately so ihat some pupils may not know 
exactly how to aaswer the questions) . Errors of administrati 
would-be a major factor only if they were ^Ur or almost ,all| 

' operating in^the same^ir«:tion (either to raise scores generally 
or to lower them generally). 

In any event it is undoubtedly true that some variability in 
test scores can be traced to test administration. This, again, is 
a factor that may cause some variation in norms between two 

- tests purporting to measure the same thiiig. I 

Otganiztng the Scores for Interpretation 

After our XYZ Arithmetic Test has been administered in the 
240 school systems, let*s assume that 20,000 pupils were tested 
at each grade level, 7, 8^ and 9. Now what happens to those 
60,000 test scores? Or, more specifically, what happens to the 
20,000 scores for grade 8 pupils? Let's assume, that one wants 

10 
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to use percentile ranks for reporting the results of the XYZ Xest. 
In all likelihood an electronic computer would be used to com- 
pute the percentile ranks, and provide a set of percentile ranks 
such as the one given in Table 1 . 

' ^' ^ Table 1 ^ 

XVZ Arithmetic Test 
Nomis for Grade 8 ' ^ 

Raw Score*' / Percentile 

— 2V-25 / ^ 99 

^ 21-22 ' 95 . \ 

19-20 .90 
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18. 75 -\ 

. . 70 y 

16 • ^ 60 \ 

' 15 / -50 • . ' , ^ 

--, ^14' / ' ^ ^ 40 ■ 

IS . . , 30 ' " ^ 

IM2 . . \ 25 

.8-10 ^ 10 , 

6-7 . 5 

. 0.5 1 ' 

•Since most pupils middle scor<js and few get very high or low scores, 
several 'diffcijmt-raw scores sometimes yield the same percentile. 

This make-believe table illustrates one simple my^of report- 
ing test norms. Examples of interpretation would be: 

1. Mary got 18 items correct, giving her a percentile rank of 
75. This means tliat. of the 20.000 eighth graders who 
took this test during the standardization process, three out 
ot four got scores lower than Mary's. 

2. Bill got 13 items correct, giving him a percentile rank of 
30. This means that, of the 20,000 eighth graders who took 
this test during the standardization process, only 3 out of 
10 got raw scores lower than Bill's. 

If one is willing to assume that the pupils tested in the XYZ 
Arithmetic" Test standardization are typical of all eighth graders J 
nationally, one can extend these ^mple interpretations to read: / 
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1. Mary scored liigher than 75 pupils in a group of 100 
typical eighth graders. . 

. 2. Bill scored higher than 30 pupils in a group of 100 typical 
eighth graders. . 

Norms and Standants 

We shall return to the question of interpreting test norms in a 
later section, but now let's look again at the XYZ norms. Think 
of the XYZJTest as one using multiple-choice type items with 
five choices (a, b, c, d, e) . A pupil not knowing the answer to a 
particular question could guess at the answer, a^d every so often 
{about one time in five) he would be likely to get an item correct 

^ by chance alone. Jn a test of 25 items of that ty^^ a person know- 
ing absolutely nothing about eighth grade arithmetic might get 
a raw i,core of 5 (or 4 or 6) by chance alone. Thus, in.a very 

* real sense, .the lowest ''thebreticar* score is 5 lather than zero. 
Or one miglit choose to say that the* effective range of scores 
on the XYZ Test is from 5 to 25.* 

All of this is leading aip to another point, a point that differ- 
entiates standardized test norms as yardsticks from the yardsticks 
that teachers generajly use with their own classroom tests. Notice 
tliat a 50th percentile is achieved by any pupil getting a raw score 
of 15 on_the XYZ Test. But 15 is only halfway between the 
lowest theoretical score and the highest (5 to 25). This says 
that an "average" pupil on the XYZ Arithmetic Test actually 
got only half of the items correct above the chance level. What 
self-respecting classroom teacher would classify as average a 
pupil getting only half of the items correct on one of his tests? 

Most classroom teacliers giving classroom tests would expect 
their ''average** pupils to get at least 80 or 85 per cent of the 
items correct, because "failure*' (as defined in many schools) js 
repres ted by anything less than 70 per cent correct response. 

■ * J" practice some pupils do get less than chance scores. ,Thls may be due to 
misinformation-to a pupil's having learned wrong concepts rather than correct , 
ones. The chance scores apply onljr when the pupil makes pure guesses. 

12 . 
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Such a yaidstick, if applied to the XY2fTest, would suggest that 
'/passing" (minimum accepted raw score) would be 1 8, and **aver- 
age" would be a raw score of 21. But the typical teacher's yard- 
stick of pass or fail as associated with certain percentages is not 
the yardstick of test norms. The yardstick oj test norms is based 
entir^ly.upon actual performance of pupils and not upon any 
predetermined level or levels of performance. 
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A Look at Some. 
General Characteristics 
of Test. Norms 
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When items were selected for the XYZ Arithmetic Test, they 
were not selected as a teacher selects (or writes) his. The use of 
item analysis procedures (see page 4) produces items that, *as a 
whole, are more difficult than a pupil would face in a classroom 
test. In fact, the items are chosen carefully so that the average 
score, the one corresponding to the 50th percentile, will be 
rather close to the middle of the range of possible scores. The 
author does this so that the scores on his test will be spread out 
from highest possible to lowest possible. For example^ suppose 
that the XYZ Test/had produced scores ranging only from 15 
to 25 rather than from 4 to 25. It would then provide only eleven 
raw score possibilities, Jnstead of twenty-two. Spreading the . ^ 
^scores out allows for finer distinctions between performance of 
pupils,%hich produces greater accuracy, which is what we mean 
by test reliability.* A classrobm teacher is most interested in 
discovering which of his pupils have mastered a given assignment 
or developed certain skills at an acceptable level. The author of 

• Reliability may be defined as the extent to which thr scores on a test arc 
. free from being influenced by errors of measurement. In our example the 
reliability of the XYZ Test would be the extent to which pupils* true knowledge 
of arithmetic is reflected in their test scores, as against" the extent to which any 
errors of administration or scoring or variatons in motivation, etc., entered into 
their Korcs. 
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a standardized test is most interested in> putting pupils in rank 
order on the skill being measured. 

» 

Norms Rank Order Pupils 

This simple fact of test norms is commonplace to test teclji: 
nicians. Yet it is easy to miss or forget the distinct difference 
between judging pupils against a yardstick of mastery of con- 
tent and judging them against the yardstick of performance o£ 
peer groups of pupils. ^ 
This concept applies to ability tests as well as to achievement 
tests. An IQ of, 100 says simply that a pupil has performed on a 
general ability test at a level higher than half of a grqup of 
typical pupils his own age. There is no absolute scale (yardstick) 
of intelligence against which a pupil can be measured. 

The example of norms for tlie XVZ Arithmetic Test was 
developed using percentile ranks. It could have been developed 
using T-scores or stanines or band scores or even grade equiv- 
alent scores.* All are derived scores, and there is no one type of 
* derived score that must be used for a particular test. Statisti- 
cally, one could even develop an IQ scale for our XYZ Arithmetic 
^ Test. Hopefully, no one with comm<^sense would do it. The 
Ichoice of type of norm to be used ta^teport the scores foi* a 
.'given sundardized test is up to the author arid publisher. In 
practice the -authors of achievement tests designed for the ele- 
^enbry grades generally choose to develop norms in terms of 
^de equivalent scores and percentiles. Authors of intelligence 
tests generally choose to use an IQ scale (though some use per- 

• T scores: A type of derived score ranging from 20 to 80, with an average score 
of iVO.. It generally presupposes a norma! distribution of scores. 

Stanine: A type of derived Korc ranging from 1 to 9, with an average score 
of 5.. It assumes a normal distribution of scores. 

Band-scores: A type of derived score giving a range rather than a single pojnt, 
c.g.».a percentile band of 40-60 rather than a percentile of 50. ^ 

Grade equivalent icore: A type of derived score based, upon the average per- 
formance of pupils enrolfed in specific grades; eg., a grade equivalent of 8.5 
means test performance comparable to that of a pupil in the fifth' month of'the 
eighth grade. 
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centiles or percentile bands). Authors of aptitude tests and 
other secondary level tests often chdose percentiles, but some 
now are reporting test scores as stanines. An organization such 
as tlie College Entrance .Examination Board may choose torde- 
vdop its own score scale (from 200 to 800, with an average, 
score of 500) . Whatever the choice is, it probably will be b^sed 
on precedents and author preference. The choice is hot apt to 
be dictated.by statistical or normatiye considerations. 

Norms May Be National or More Limited 

The question of whether to /report test norms as-na/iona/ 
norms also is optional with a test author. Most tests used in the 
schools^db have, or claim to have, national nonns. This is based 
on custom and on a desire of the author and publisher to sell 
their product in all parts of the country. A test with norms based 
only on pupils from Main^^is not apt to sell widely in California. 
This is not because the /norms would necessarily be inappro- 
priate, but there certainly would be an. element of doubt in the 
minds of the potential Users. Test users tend to feel, more con- 
fidence in a^^test if th^y know that students in their own state or 
region were included in the norm group. 

The us^r of any standardized test will almost certainly be pro- 
vided with natioT^al norms. Occasionally he also may be able to 
obtain regional^Horms from the test publisher.* In some cases 
he can get stat^ norms from a state testing program operating in 
his own state^ If he is ambitious and has a bent toward working 
with numbers, he may develop his own local norms or cooperate 
with othei^ schools in developing group norms. Whatever norm 
group he chooses to use, individual, pupil results must be in- 
terpreted in relation to the population of schools in that i)orm 
grouj)^ While the user of national norms may rightfully compare 
a pupil's performance to that of other pupils his'age or grade 
nationally, the user of state or local norms must make fiis inter- 
pretation by comparison to state or local groups. 

. ^ 16 
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Test Norms Are Not Absolute 

By now the reader should be well aware of the relative nature 
of test norms, which grows out of relating each pupil's score to 
the scores of ojther pupils who took the same test. There are also 
other factors^hich tend to make test scores relative rather than 
absolute. , ^ - 

For example, our hypothetical XYZ Test was designed to 
sample arithmetic skills commonly taught in junior high schools 
in this country. But not all junior high schools expose pupils to 
the same arithmetic skills at.the same level. Many schools are 
teaching some form of 'modern^ mathematics.** To the extent 
that a particular school system may be covering some aspect of 
modern math over and above the traditional arithmetic skills; 
its pupils should be^ble to perform satisfactorily on the XYZ 
Test. Even in these cases, however, the pupils will not be able to 
demonstrate their added proficiency in modem math on the 
XYZ Test. Furthermore, some school systems with modem math 
programs may not cover the traditional arithmetic skills to the 
samedegree that other schools do; thus, their pupils may not 
perform as well on the XYZ Test as pupils from these other 
schools even though they are equally good at arithmetic as a^ 
whole. . Applying the same norms to students in difflerent schools 
is justified only if they have had essentially the same opportun- 
ities tp learn what is being tested. 

The relation of test norms to content applies also to standard- 
ized tests of ability. No one has ever established an ^aibsolute^ 
scale or test to measure general intelligence— a test that is not at 
all dependent upon what the pupil taking the test has learned. If 
different children of the same age have not had similar oppor- 
tunities to learn, then differences Jn their scores may be due 
to differences in opportunity rather than in ability. Further- 
more, authors of different intelligence tests sample sgmewhat 
different aspects of intelligence. One test author hiay choose to 
Measure verbal ability by using verbal analogy items: 
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(Cold is to hoi as straight is to a. crooked b. narrow 

c. up d. out e. forward; 

another by using regular vocabularly items; and still another 
by sentence completion items: * . 

(What goes up must come — a. down b: flying c. crash- 
ing d. a cropper); 

While each type of items may very well measure some aspect of 
verbal ability, the various types do not necessarily measure iden- 
tical elements of that ability. 

Test Norms Are not Universal ' ' 

Perhaps the^most obvious way to illustrate this lack of uni- 
versality is to ask: Would it be feasible to administer our XYZ 
Arithmetic Test to pupils in France? On straight computation 
items we could^expect French pupils to perform satisfactorily, 
but on story problems printed in English not at all well. The 
norm group f r the XYZ Test was based on junior high pupils 
in the United States and it is unrealistic to assume that pupils 
everywhere have been exposed to exactly the same skilk^and 
understandings and will perform exactly the same way. 

Even in the domain of ability testing, where one might.choose 
to use some pictorial type of item: - \ 

(A-> A: : □"^►—a.o b. □ c. A d. ,0 e. . ^ 

it is unrealistic to assume that pupils everywhere have learned 
geometric figures and relationships between them in exactly 
the same way and at exactly the same ages. 

Test Norms Are Not Permanent \ 

Test norms are a product of the time whV-n they are developed. 
If a test is used with the same pupils at *Wo widely separated 
timeSM^we expect their scores to change. \Probably the most 
obvious example of changes in norms as pupils age is that pro- 
vided by the fall and spring norms developed for most ele- 
mentary-level achievement tests. OiLanj^ achievement test, the 
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norms for the end of a [)articular grade may be expected to be 
higher than for the beginning of that grade. 

Thiis, any standardized test that is to be used at more than 
one time during the age or grade development of pupils must 
have separate norms for each time, or time unit, when it is to . 
be used. A test such as an algebra aptitude test may get by with 
only one set of norms— applicable at the point just prior to 
beginning formal instruction in algebra/ say the spring ieriTLof 
the 8th grade in most schools. But our XYZ Arithmetic Test, de- 
signed for grades .7, 8, and 9. would need a , minimum of three 
sets of norms and preferably six (both first and second semester 
norms for each grade) . 

There is another important factor which makes norms a 
product of a given ^time^ period. Human knowledges and under- 
standings themselves change over time. For example, a pictorial 
inteUigencetest developed in 1930 might have used a picture of 
a trolley car, a refrigerator with the motor on top, or an airplane 
with two wings, one above the other. All of these pictures would 
have been recognized easily b*y boys and girls in the 1930's, but 
could be puzzling for some children today. In like fashion, our 
language itself changes. At one time the word Mars had its 
traditional meanings when used as a vocabulary item in a test. 
.Now it must a(so be keyed correct if the answer is candy bar. 

Furthermore, school programs and teaching methods change. 
Particular subject matter may be introduced earlier, the'school 
may become more demanding,, improved instructional materials 
may make learning easier. Such changes can have considerable 
effects upon pupil scbres over a period of years. 

Because of these changes in knowledges and understandings 
and ways of organizing instruction, tests must be re-examined\ 
'periodically, say every five years or so. A thorough- re-examina- 
tion requires two things: 1) a statistical check/of each item in a 
test to see whether pupils still answer it in the same way they 
did previously; and 2) the development of/new norms for the 
entire test (or a revised version) so that any overall changes, in 
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group scores are reflected in the norms. Item analyses generally 
are not undertaken in less than an Isight or ten year span and 
sometimes not that often. The development of new norms may 
take place more often, particularly if evrdence begins to accumu- 
late that a set of norms is for some reason unrealistic. \ 

In any event, it is important for a test user to ascertain the 
time whenjhe norms were established for^the test he'^is using 
and to interpret/results accordingly. If the norms of our XYZ 
Test weije developed in 1964, and those of another similar 
arithmetic test in 1954, a user might question whether results 
from.the|two tests should be ooiSparcd. The many changes tak;^ 
ing place in mathematics instruction in the intervening period 
could have affected the two sets of norms. 

Another .very important consequence of the fact that test 
norms are a product of a particular time relates to the question 
of comparing test Vesult^ taken from different tests at rather 
widely separated points in time. The apparentjy simple question 
of wliether pupils could r^d better twenty years ago^ orcompute 
better, or dp anything else better is .almost iippossible^to answ1^ 
exactly by using test results. Over a twenty year ^^n of time a 
particular test might well have been revised arid supplied with 
new norms twice or more. Thus, two diffei;:^nt yardsticks would 
be used to assess similarities or differences over such a long 
period of time. . / ' 

.One can develop an example, or this phenomenon with our 
XYZ Arithmetic Test. We ^ypothesized a set of 1964 norms 
with a percentile of 50 corresponding to a raw score of 15 for 
grade 8. Suppose that in 1974 new norms were to be developed 
and at that time a^<v^ jcore of 17 turned out to be average. We 
would then assign a pQrcentile rank of 50 to the raw score of 17 
and change/alt pf the oi:h6r percentiles appropriately/^ Thus, 
for exac*tly the same set of test items, a pupil in 1974 would have 
to ge/^nvo mo,rt5' items correct than a pupiLin 1964 in orderlto^ 
achieve the same relative position (50th percentile) among 8th 
'grade pupils. • , ' 
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It cannot be stressed too, oftc * ♦hat test norms are yardsticks 
developed at a particular time,^with a particular group of pupils, 
and with a particular selection of test content. To generalize 
to other times, to other types of pupils» or to other contientjs a 
questionable practice. ^ 

Test Norms Assume Comparable Educational 
Background for All ^ „ 

In a strict sense this statement is not true, for it is te^ authors 
' ' test users who make this assumption rather than the-nonns. 



If forour^XYZ Test norms we had used only a thousand 8th 
grade pupils, all of whom attended one selective junior high 
school, the mechanical procedure of developing the norms would 
not Have changed. However, our purpose in doing such a thing 
would have been different from our purpose in sampling all 8th 
graders nationally, and users* interpretations would b^ vastly 
different. In general, a test designed to be given nationally in- 
cludes items that sample understandings to which all or almost 
all pupils have been expo3ed. The reason some pupils score 
higher and others lower is, theni priniarily dependent upon each 
individual's retention of knowledge or level of development. 
The difference is assumed not to be the result of differences i* 
opportunity to learn. . , 

In like fashion, intelligence tests assume a common back- 
ground of opportiitiities to pick up information and skills, some 
related rather directly to school activities, others related to 
learnings acquired ^utside of school. The assumption is made 
that the more intelligent pupil will develop his. knowledge 
and skills to a higher level than the less intelligent pupil, both 
having been exposedlto comparable educational opportunities 
in and out of school. 

Clearly, then,~an atypicar pupil or an atypical group of pupils 
(atypical i;i terms of educational opportiinitjes) may n^t be 
judged "fairly** by a test which assumes equal educational J)ack- 
grounds. Some may have had very rich opportunities to Idifn;^ 
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others very meager ones. The diuciences among their scor.es— 
or some part of the differences— may] then, be chargeable to 
differences in opportunity rather than | in ability. 

However, before taking this statement as a condemnation of 
tests or of test forms, remember what the basic purpose of^t^est 
norms is. By definition, normslTre^^ar^sticks designed to relate 
individual pupil performance to the performance of "known." 

oups^of pupils. The known groups generally are and should 
be "typical" groups. If a test is designed for widespread national 
use it must ^^^red to thCgja^ge or typical population. 

To recbncjft tne fact that test norms are geared to the general, 
typical ^^1 population while 5pme pupils or groups of pupils 
ate noftypicalicalls for common sense on the part of test users. 
To condemn. test norms for not providing useful information 
for all pupils un^er all conditions is unrealistic, just as it is 
unrealistic to condemn a textbook in reading designed for 5th 
grade pupils because there are some 5th graders who cannot 
grasp the content while s,ome others are far ahead of the text. 

How might all of this^pply to the XYZ Arithmetic Test? We 
hypothesized that its norms were designed for use with all 7th, 
8th, and 9th graders nationally. Thus, it was designed to measure 
arithmetic knowledges and understandings common to typical 
junior high p{ipilsein the United States. But one can see how 
judiciously its results must be interpreted if we look at-fwo 
examples: * 

i. John's father is an engineer who works constantly with 
figures. He has a desk calculator in his home and has taught 
John how to use it. In fact,/John does some of his father's-calcu- 
lations for him and receives a small stipend for his work. John's 
teachers know of his work'witll figures and have encojirjiged 
and utilized his arithmetic skills, John fcored at the 95th per- 
centile on the XYZ Test while in the 8th grade. ^ 

The only fact showprby^he-test results is that John scored 
higher than 95 per mit of other 8th graders; the reasons why he 
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scored that way are speculative. Perhaps he is bright, perhaps he 
studies hard, perhaps his added practice at home~ increased his 
score, perhaps all of these factors are present. 

2. Tom's father, is an itinerant field worker, whose formal 
education stopped at grade 6. Sometimes his family is with him, 
sometimes it is not. Tom often starts school in September in 
one community, but when his family nioves in October he is out 
!of school for a few weeks and then enten a different one. Only 
^the simplest arithmetic, mostly making change, is used in his 
home. Tom learns and practices arithmetic skills only in school. 
Tom scored at the 25th percentile on the JjCYZ Test in grade 8. 
What dbes this mean? ^ ' 

Perhaps he really has less than ^erage ability to develop his 
arithmetic skills, perhaps he is- not highly motivated to do 
school work, perhaps his family background is a handicap he' 
has not been able to overcome. If Tom had been reared in a 
typical home with typical educational advantages he might have 
scored even higher on the XYZ Test than John did. Never^- 
theless, it is a fact that, when compared with an average group 
of 8th graders on this arithmetic test, his score was surp^used by 
75 per cent of them. Again, common sense must be applied to 
determine .the why or whys* 

Test norms establish relative position within a group. They 
do not establish the reasons for that position. ^ pupil with an 
atypical educational or sociological background may score higher 
or lower on a test than he would have under typical conditions. 

Test Norms Are Alore Apt to Reflect Typical Performance 
Than Maximum Performance 

There is a notion about pupil performance on standardized 
tests that says, "You can't score higher than your actual ability 
(achievement) level.'* Such |a statement seems so obvious that 
few would question it. Actually it should be questioned, for at 
least two reaspns: 1) the choice-type of test item does allow a 
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pupil to receive a higher test score by chance alone than his 
theoretical **true" score; and 2) the process of developing test 
norms suggests that typical performance rather than maximum 
performance is sampled. 

If one remembers that most standardized tests now us^ 
multiple<hoice items, It follows that most pupils will guess on 
some or all of those test items for which they do not reaUy know 
the answer. Most test authors no longer use a -correction-for- 
guessing formula. They are more apt' to suggest that all pupils 
attempt to answer all items. This is an invitation to guess when 
the answer is unknown. For most pupils guessing does not alter 
their relative performance appreciably. This is becausej if 
everyone guesses, all raw scores are raised slightly, and the rank 
order of the scores remains essentially the same. (Remember that 
the major purpose of test norms is to determine rank order.) 
Occasionally, however, a pupil may suffer or gain by guessing 
very poorly or very successfully. Thus, occasionally, by chance 
alone, a pupil may get a lower score (Relatively) than-his real 
ability or knowledge warrants. And occasionally,rby chance, 
alone, a pupil may get a higher score (relatively) than his real 
ability or knowledge warrants. 

^ The second point, relating to typical or maximum perform* 
ance, probably is much more important than the fir?t. There is 
a considerable body of evidence that 'demonstrates the impor- 
tance of pupil motivation in test taking. Artificial means, such 

"as rewarding ttitees with small cash payments, have been shown 
to raise group test results above those obtained under normal, 
routine testing conditions. Probably every reader can think of 
examples of pupils who "don't care" when, taking tests and who 
seem to achieve lower test scores than their other behavior would 

- suggest. And there is always the eager pupil who considers every 
test a personal challenge to get a high score. Thus, one can think 
of situations in whith pupils may perform either better or worse 
than their own typical performance^ 
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Think again of the standardization of the XYZ Test. It was 
given to 60,000 pupils in 240 different schools. It was haigLdled 
by teachers, administrators, and counselors, undoubtedly 
with some variation in adequacy of administratipn. The pupils 
innhis-hypothetical case Would have been told to do their best 
work, but that the test scores would not be used to grade them 
or in any other way be used^ for selection, or promotion; ot 
honors, or other classifications. Some of the pupils in the 60,000 
undoubtedly would take the testing _very seriously and would 
work .very hard at it; others would take a so- what attitude and^ 
just go through the motions, doing the easy items and ignoring 
or guessing at the difficult ones. But the lai^e majority of the 
60,000 probably would approach the XYZ Test with an average 
or typical amount of motivation— they would do their best, 
within limits. While this is speculative, it seems likely that the 
highly motivated and poorly motivated would tend to even each 
other out and the largp number of pupils with average motiva- 
tion would tend to dominate the scores in the norm group. 

All of this rationale is designed to demonstrate once more the 
statement that test norms represent typical performance rather 
than maximum performance. Most pupils, un^er conditions of 
strong motivation can perform or achieve at a |evel above their 
day-to-day performance. Test norms are geared more to their 
day-to-day operating level of efficiency than to their maximujii^ 
level. 
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Factors Affecting a Choice 
Between Specific, Well-Definea 
Norms aiid General Nb3:ms 



Most of the examples used so -far have involved national 
norms, although regional, state, and local norms have been 
mentioned. The only real difference between these various types 
of norms is the > geographical definition of the population that 
is to be sampled. If there are difFe^nces between the national 
norms and the state norms for a given test, they are the result 
of achievement or ability differences of the pupils in that state 
when related to the other 49 states. However, there are times 
when it is desirable to develop and use norms based on or 
related to characteristics that are not dependent upon geography. 

1. Regional differences. Before discussing other types of 
norms it might be well to mention a. characteristic of national, 
regional, and state norms that has become clear in the past 
several years. .There is considerable evidence to indicate 
rather- consistent differences irt group test performance between 
. three large geographical areas of the country. Consider Area I 
as New England and tlie Middle Atlantic states (including Mary- 
land* on the south, and Pennsylvania and New York on the west) . 
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Consider Area 11 as the Southeasteni states (including all the, 
so<alled "border*' states on the nortft. and going as far west as 
the Mississippi River). Cpnsider Area III as all other states, 
Midwest, Rocky Mountain, Southwest, and Far West. 

Having defined these areas one can say that group test results 
on achievement and ability teits tend to be highest in Area I, 
lowest in Xrea II, and in the middle in Area III. These group 
differences are large enough to appear to be significant (non* 
chance) . However, there is much more overlap than there is 
difference. Recent evidence for this statement comes from 
Project Talent* results and also from the selection scores for 
various states in the National Merit Scholarship competition.^ 
Information from both sources shows the same regional differ- 
ences. Project Talent results are from a wide-range series of 
general information tests, and data were gathered from a sample 
of all pupils enrolled in the 9th; 10th, 11th, and 12th grades in 
the entire country. National Merit selection scores are for a 
test of general educational development and apply only to the 
top one per cent of the pupil population. The reason or reasons 
for these differences are not easily determined; the reader must 
decide for himself v;hy they appear. 

Lest the reader be left with the impression that regional differ- 
.ences in norms are overwhelming, it should be mentioned that 
there are other differences between the achievement levels of 
various schools that are much, more dramatic than regional 
differences. Project Talent results, for example, demonstrate 
that achievement in schools located in high socio-economic level 
areas is more like achievement, in schools in similar socio- 
economic areas all over the country than it is like achievement 
in other schools in the same geographic area. The Project Talent 
staff developed a classification of schools according to the average 

* barley» John T. "A Sy&irm tor Classifying Public Schools" (Project Taldit 
Results of Initial Analyses). A paper presented at AERA and AASA Meeting* 
Atlantic City» New Jersey. February 1962. 

* Guide to the National Merit Scholarship Corporation, Evanston. Illinois, 
August 196^, pp. 15*14. 

/ 27 ' 



achievepient level o: their pupils. The classification is based on 
geography, socio-economic level, and size of community. 

2^ex differences. Many standardized tests (most general in- 
telligence tests, achievement and skills batteries) make no differ- 
entiation in norms by sex. A few have developed separate norm 
tables for boys and girls. There is nothing, particularly mysteri-. 
ous about sex differences in test norms. In general, girls tend 
to getsomewhat higher scores on tests that are verbal in nature, 
that depend in large part upon command of the language. In 
.general, boys tend to get somewhat higher scores on tests that 
are numerical or mechanical in nature. V/hen separate norms 
are used for boys and girls, the person interpreting test results 
must make a mental note that he is comparing a pupil's score 
with scores of boys or girls only, as the case may be. When sep- 
arate norms are not developed, there still may be some sex 
differences, but either they are small or the test author feels that 
It is best to compare all pupils on the same .norms regardless of 
the difference. 

3. Types of school. The independent schools (and^ certain 
publiqschools) associated with the Educational Records Bureau 
have felt a need for their own separate norms. These schools 
are selective.in nature, tending to enroll pupils of above average 
academic ability. They have relatively homogeneous student 
bodies with similar abilities and similar goah. Since these stu- 
dent bodies are not typical of most public schools, many of tHe 
independent schools prefer to compare the performance of their 
pupils with that of pupils in other like schools. The Educational 
Records Bureau develops independent school norms for the 
te^ts used by its member schools. Many test specialists believe 
that such specialized test norms as these have greater utility than 
general norms.* Both, types, no doubt, have their place. 

• 4; CEEB norms. The College Entrance Examination Board 
tests illustrate still another type of norm. The CEEB norms 
(mean of 500, range from 200 to '800) were based on the per- 
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fortnance of those college applicants who took the Board tests 
in 1941. Subsequeiit Board tests have been equated back to the 
1941 group. The, reader may^well .question the adequacy of 
norms which go^yl^more tl^n 20' ye^rs. There cfertainly is 
reasdn to assume that college botmd students now are not the 
same as college-boui/d .students in I^L However, the»practice 
can be justified simply by remembering that nomnsiimay be 
thought of as a yardstick. Anthropologists tell us tha^children 
are taller now than they were a generation ago. , Yet we measure 
them with the same yardstick. In like fashion, cpUege-bound 
students in 1964 may be more or less able as a group than those 
taking the Boards in 1941. Yet, within any year's group of col- 
lege-bound students, they will have the same Relative position 
in relation to each other regardless of the norms used. They^ 
may or may not have the same mean of 500, but that is not 
particularly important. Another way to look at^ this point is to 
realize that the actual, norms of the College Board tests are not 
as important as the range of scories and average score of a particu- 
lar college to which an applicant is applying. For example, if 
Jim has a V (verbal) score o£ 480 and ap M (mathematical) 
score of 540, it does not reallylnatter that one is slightly above 
500 and one slightly below. What really matters Is that Jim has 
applied for admission to two different colleges. Ivy College has 
a student body with Board s^iores running from 350 to 650 and 
with a meaaof 450;.StaCe'College has a student h^y with Board 
scores running from 450 to 750 with a mean of 600. Jim might 
. be accepted by either school, by one, or by neither. If accepted 
by both, he could choose to enter Ivy, where he would be slightly 
above average in general ability; or State, where he would be 
slightly below average. In either case, the important comparison 
is with the norms of the colleges to which Jim applied, not the 
1941 yardstick that was used to get the measure of ability. 

5. School norms vs. pupil norms. One of the most puzzling 
aspects of test norms to many people is the difference between 
school norms and pupil norms. Up to this point this brochure 
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has been concerned with .pupil norms. Looking back at the 
XYZ Arithmetic Test, we hypothesized that 300 school systems 
were chosen in the sample but only 240 systems participated. 
The norms presented in Table 1 were based on the imaginary 
sample of 20,000 8th graden, with all 20,000 scores piit together 
in one distribution. That process produces pupil norms. 

Now let us suppose t\m we had dbne something else. For 
each of the 240 school systems riiat participated in the norming 
of the XYZ Test it would lie. possible to obtain a single mean 
(average) score. Thus, we could get. 240\ mean, scores. The 
highest mean score for thej^chool with the\most able student 
body would not be 25 or 24 or 23 or any other score so close to 
perfection. It might be tha : some one or two^schools would get 
mean scores as high as, say, 19, and possibly some others as low 
as j 1. However, most of the! 240 mean scores would fall at about 
14, 15, or 16. There would, in fkct^ be a distribution of scores, 
240 of them, forming a fairly normal distribution with 15 in all 
probability as the middle of the distribution. Such a distribu- 
tion of scores would serve as a basis for school norms rather than 
pupil norms. Figure 1 shows the difference. 

Figure 1 - 
XYZ Arithmetic Test 
Pupil Norms and School Norms 




School Norms i i . i . - . . i i i 

1 5 10 25 50 75 90 % 99 
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/ Now let's look at two spe(;ific examples: / 

Horace Mann Community Schools- mean XYZ score for all 
pupils, 17 

Pestalozzi Area Schools— mean XYZ score for all pupils, 12 

In tlie case of the Horace Mann schools an average raw score 
,of 17 would be a school nonii of ihe3_0_th percentile.^ to 
say, the pupils in the Horace Mann schools, as a group, achieved 
at an average level higher than 9 out of 10 of the^ 240 school 
systems in the XYZ norm group. But, that does not say that the ^ 
average or typical pupil in the Horace Mann schools scored 
above 9 out of 10 pwpib elsew ere. Actually the average pupU^^ 
at Horace Mann scored at the 70th percentile on pupil norm^> 
(see Figure 1), or above 70 per cent of a group of typical pupils 
elsewhere. ~ Jj - 

In the case of^he Pestalozzi schools the average raw scovejot 
12 can rightfully be interpreted as a school percentile of .5. OJ^ly 
'5 per cent of a group of fOO typical American schools could be 
expected to achieve average scores below the Pestalozzi schools 
(if we consider our sample of 240 schools as typical). Hoil^ever, 
the average pupil at Pestalozzi scored above 25 per cent of a 
group of typical pupils and below 75 per cent (see Figure: 1) . 

Both school norms and pupil norms serve useful p^irposes. 
But the purposes are different and should not be confused, one 
with the^ other. _ , / 

6, Comparable forms. Many standardized tests are published 
with more than one form. The different forms may be referred 
to as comparable fornls^ parallel forms, or equiyJilent forms. 
Eacli form of a test is designed to be as nearly fike the other ^ 
form(s) as it is possible to make it through careful item selection. 
If all forms of a test were exactly alike, if they, correlated + 1.00 
and had the same mean and standard deviatioii, then one could 
use the same norms for all forms and expect them to be entirely 
satisfactory. However, to the extent that the correlation between 
two forms of a test is less than +1.00, and to the extent that the 
means and. standard deviations vary a bi^ from form to form, 

'31 



a question arises as to the appropriateness of using identical 
norm.tables. Of course, if the correlation h +.99 and the means 
and standard deviations v^ry by only a tenth of a point, one 
certainly would not question the use of the same norms. But if 
the correlation between two forms is only +-80, with means and 
standard deviations varying by several points, one would^^cer- 
tainly feel that eacli form should have its own separate norms. 
Some place there is a breaking point beyond which it ^s notr 
appropriate to use the same norms for two forms of a test/ That 
breaking point cannot be flatly specified. In practice some test 
authors supply differentiated norms when the differences are^ 
very small; other authors will tolerate larger difFerencies under 
the same norms. / _ 

When a test provides different norm tables for Forms A and B, 
the authors are saying, in effect, "In spiti of the fact that we 
attempted to build equivalent forms there still is enough differ- 
ence between the -two so that it is essential , to provide separate 
norms." When another test provides one set of norms for F.orms 
X and Y, the authors are saying, in effect, "The two forms are 
so close in their statistical propei'ties that it is not^wprth while to 
provide separate norms." The reader should realize that the fact 
that different authors take these different points of view does not 
necessarily mean that the Forms X and Y actually are clos<sr to 
each other than Forms A and B. It may just mean that the 
different authors have used different standards of statistical rigor 
^n making their decisions. For example, the comparable form 
reliability between A and B may be .85 and between X and Y 
also .85. Then the author of Forms A and B has said, "A corre- 
lation of .85 is not high enough to warrant the useof the^same- - 
norms", while the author of Forms X andTY has said, "A correla- 
tion of .85 is high enough to justify the use of the same norm . 
tables,". * • . 

All of this leaves the consumer— the teacher, counselor, ad- 
ministrator—in a somewhat awkward position. Unless he p)ans 
on developing his own local norms for every form of a standard- 
ized test that he uses, he must accept the decision of "same" or 
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"separate*' norms that the test author has made in developing 
the test. He can keep in mind, however, that the development 
and u$ex>t separate norms for each form of a rest is, statistically, 
*the conservative approach- It makes no assumption of complete 
or almost complete identityjof the two or more forms of a test. 
Rather it treats each form independently and establishes norms 
for each. Thus, it is the more accurate of the two approaches. 

7. Selection scores decrease the need for test norms. In 
some situalions the selection or "cut-off" score is more imporUnt 
than the norms. Thus, in a given college, the norms of the 
CEEB tests may be relatively unimportant because the college 
has decided upon the range of scc^es^t will accept. 

We can see the same thing in a simple example. Suppose a 
school has used an algebra aptitude test for a number of years. 
The administrator probably has a pretty good idea as to the 
meaning of certain raw scores, without reference to any norm 
table. For example, he may have noticed that not a single stu- 
dent with a raw test score below 20 has ever passed algebra in 
his school, and that nolbne with a raw score below 23 has ever 
got a C or better. It does not really matter to him whether a 
raw score of 20, is equivalent to a national percentile of 5 or 
25 or even 50 or 75. What matters is that he knows what certain 
scores imply for success or failure in algebra as it is taught in his 
school by his teachers to his pupils. 

Another.type-ot situation in which norms are not very im^ 
portant occurs when a limited number of pupils are to be 
selected out of a group. Suppose a high school is instituting a 
new Advanced Placement course in j>hysics. It has 30 pupils 
who request enrollment in the course, but enrollment is to be 
limited to 15. If ability test results are to be used as a part of 
the selection process, it does not matter much what the actual 
IQ*s (or percentile ranks) are. It probably is more important 
simply to put pupils in rank order in terms of ability test scores. 
For that purpose, raw scores would work just as well as IQ's 
or percentiles. ./ 
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There are not many school situations in which selectidn scores . 
are^ll-important and norms of no importance. Generally, 
school administrators and counselors are concerned with mul- 
tiple criteria for selection. They wisely indude past grades, 
teachers* recommendations, and evidence of motivation as well 
as test scores, so that any one test, score is just another bit of 
evidence. Nevertheless, in some school situations the relation- 
ships of test scores to demonstrated success or failure are more 
important than test norms. 

Avoiding Confusion with Summary Statistics 

Occasionally the word norm is given a completely different 
meaning than it generally carries, being used to refer to an 
average' score (to a mean or to a median). When a principal 
asks his director of testing: "What is our norm on the XVZ 
Test?*' he probably wants to know the school average, .not a 
listing of raw scores Or percentiles such. as that in Table;,!. This 
adds a meaning to the word norm that it would be better to 
ascribe only to mean or median. Since there are precisely. defined 
words to use for the different measures of central tendency, there 
is no reason to add another term. Even the word average is 
more acceptable to describe a measure of central tendency than " 
the word norm, Still, it is wise to keep iQ mind that this other, 
additional meaning*^- is -used rather, widely. The educator who 
wants to appear knowledgeable in the test domain should avoid 
using the word norm when.he wants to refer to an average score, 
just as hie should avoid using the word correlation when he is 
speaking simply of a relationship between two things. 
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A Special Question: Whether or Not 
\ \ to Develop Local Norms 



As has been mentioned previously, the only basic difference 
between local norms and national norms is in the defined popu- 
lation. Statistically the^e is no difference in the way one com- 
putes local norms. However, in the total process of developing 
norms there are some steps that may be left out when developing 
local norms. In all probability local norms will be developed 
after a test is already in uste in a school system. Therefore ^there 
are not going to be any sampling problems. If the XYZ Arith- 
metic Test is adopted by a particular school system, all 8th 
gradersr-not just a sample of 8th graders-will be given the test, 
i * * 

Characteristics of Local Norms 

In a middle-sized system it is^ often advisable to use the 
scores of al! pupils taking the test in a particular year to 
establish local norms. In very small systems, scores may be com- 
bined over several years to get the norms. In larger systems it is 
not necessary to use all the test scores. Thus, one might use only 
one-half or one-fourth or one-tenth, or some other appropriate 
fraction, making sure that the portion used is selected randomly, 
and that at least several hundred scores are used. 



41 



There are two special points to keep in mind when deciding 
whether to develop local test norms. They are: !) local test 
norms do not change the rank order position of any pupils from 
their positions on national norms; and 2) local test norms simply 
moye a pupil's score up or down (or leave it unchanged) from 
national norms. 

The position, or rank order, of any pupil's score is determined 
by his raw scor^, not.by a norm. A normative score is simply a 
more convenient score for expressing that ordinal posittoii. If 
Joe got 20 items correct on the XYZ Test and Steve got only 14, 
Joe will have a higher percentile (or stanine or grade equivalent, 
etc.) on national.porms or regional norms or state norms or local 
norms or building norms, or any other norms one cares to de- 
velop. When put this way the statement is so obvious that one 
may question the need for even saying it. However, too many* 
users of tests feel that a local norm somehow provides diflFerent 
information about a pupil than a national norm. Actually local 
norms provide the same information, but they relate that in- 
formation to a diflFerent group. ' ^ . 

That brings us to the second point. If we refer back to the 
idea of test norms as yardsticks, we can say that local norms 
provide us with a yardstick that has diflFerent units (or numbers) 
on it than those on the national norm yardstick. Iri practice, 
then, this boils c^own to the following relationships between 
national and local norms: ^ * , 

1. If a local school system has a student body that tends to 
score above national norms, its own local norms will lower 
almost everyone's derived score below what it would be on 
the National norms. 

2. If a local school system-has a student body that tends to 
^jcoreJ^y^cteS^ national norms, its own local norms will 

-^^^ not be materially diflFerent from national norms.' 

3. if a local school system has a student body that tends to 
score below national norms, its own local norms will raise 
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almost everyone's derived score above what it would be on 
the national norms. 
Table 2, based on the XYZ Test, illustrates the differences in 
national and local norms for the situations described. 

i 

Table 2 
XY2^ Arithmetic Test 
National and Local Norms 

Raw National 
* Score t Norm Case 1 Case 2 Case 3 
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Think of a pupil, Jane, who got a raw score of 15 on the XYZ 
Test. On national norms she would receive a percentile rank of 
50. If she attended a school system with an above average student 
body. (Case 1), her percentile rank on local norms might be 
about 35 or 40. If she attended a school system with an average 
student body (Case 2), her percentile rank on local norms would 
be close to 50. If she attended a school system with a below 
average student body (Case 3), her percentile rank on local 
norms might be about 65 or 70. 

To interpret these situations one would say, in Case 1, Jane 
scored above half of a typical group of 8th graders nationally 
on the XYZ Test, but scored above only 35 per cent of the 8th 
graders in her own school system. For Case 3 one would say, 
Jane scored above only half of a typical group of 8th graders 
nationally on the XYZ Test, but above 65 per cent of the 8th 
graders in her own school system. 
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In neither case did Janets test performance change. She was 
average on a national scale. The only things that changed were 
the two different groups (yardsticks) against which her score was 
compared on a local basis. 

Pros and Gons of Local Norms 

There are difFerences.of opinion as to the value of local norms. 
Proponents feel that they are more useful to a school system than 
national norms; detractors feel that little is gained by the efforts 
necessary to develop them. ^ ^ 

Perhaps the most important point made by proponents of 
their use is that local norms provide a fairer assessment of local 
competition. That is, if Jane is in a school such as Case 1, it is 
better to know that sh^ is a bit below average in relation to her 
classmates (35th percentile) in arithmetic than to know that 
sliie is average in relation to a national group. Or, if she is in 
a Case 3 school system, it is better to know that she is above 
average in arithmetic (65th percentile) in relation to her class- 
mates. ' 

Opponents of the use of local norms might say that, while it is 
well and good to compare Jane to her classmates, she may 
not always be in that school system, and that it is better to think 
of her arithmetic achievement in relation to pupils all over the 
country. They might say that it is best to think of Jane as having 
average proficiency in arithmetic, not below average just because 
^ her classmates happen to be particularly able in arithmetic, or 
above average just because her classmates are low in arithmetic 
achievement. 

Fortunately this question of the greater value of local or na- 
tional norms is one that does not have to be resolved. There is 
no reason why a school system cannot use both types of norms 
(as well as any others that are appropriate) and thus gain the 
advantages of both of them. 

As far as Jane's junior high teachers-are concerned, it may be 
of more interest to know where Jane stands in relation to her 
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classmates than in i elation to 



a national sample. Considering 



the day-to-day competition that Jane faces in her classes, this is a 
reasonable view. However, Jane's counselor, while he certainly 
will be concerned with her achievement within her own school 
must also look beyond, the local situation to the potential 
petition jane may face in other! schools or after her school career 
is ended For example^aae might be enrolled in a junipr high 
school with a below average student body (so she appears above 
average) but be headed toward a senior high school with an 
average or even above averag</ student body. In such a case the 
counselor could use national norms as a base , (or yardstick) that 
will not change as Jane's sdiool changes. Of course he also could 
use both the junior high and senior high ^orms to complete 
the picture"^ of presen^^and future competition. 



Special Considerations 

Before leaving the topic .of loNcal norms there are several im- 
portant points to be made. The first is simply a plea for common 
sense when deciding whether jto develop local norms or not. 
There is nothing in the process itself that would preclude a 
school system from developingiocal test norms for a group in- 
telligence test. But the very idea of intelligence as a general 
mental ability is foreign to the idea of establishing separate IQ 
tables for each school sysem. The idea is so foreign that the 
author knows of only a fev/ systems that have ever done it— and 
they were very large systems which felt they could demonstrate 
that Jheir local norms were comparable to national norms any- 
way. This extreme, and perhaps ridiculous, example is given to 
emphasize that while local norms have a place, that place does 
not encompass all types of tests. 

The best ci.se for developing and using local norms can be 
madejwith respect to achievement tests. The goals or purposes 
of achievement tests are closely, related to the specific goals of a 
school system, and the day-to-day operation of most classrooms 
is geared to increasing the knowledges, skills, and understandings 
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of the pupils in the system. It makes sense to consider the use of 
local norms with achievement tests; it does not make sense with 
intelligence tests or any other test seeking to measure some 
aspect of'human personality not closely related to school systems 
individually. 

The second point to be made here is that any school using 
local norms on a test is and always will be "aveiage" on those 
norms. It is impossible for a school to be above average or below 
average on its own local norms. The process of developing local 
norms automatically assigns the middle raw score to ^ percentile 
of 50 or a stanine of 5 or a T-score of 50— the middle score of 
whatever type of derived score is being used. When a school * 
person speaks of a system's being above or below average, he 
must be relating local statistics to national or regional or state 
norms. Again> one should not confuse local norms with sum- 
mary statistics. ^ 
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Salient Considerations 
in the Interpretation of Norms 



^ Several years ago the author was approached by an elementary 
school principal who wanted some help with the standardized 
testing program in his school. In the course of the conversation 
the principal expressed concern over the teading level of his 
5th gnule pupils, because, *Torty percent of the fifth graders are 
reading below grade level on a standardized test of reading." He 
was rather nonplussed when the author congratulated him on 
the. apparently good job of reading instruction going on in his 
school (a school with average, not above average pupils). If the 
reader, also, is puzzled let him think back over the meaning of 
national norms and the way they are developed. On jnational 
norms half (50 per cent) of all pupils are reading at or below 
grade level. By their very nature, norms automatically pick the 
middle score of a distribution and define that point as average, 
as. grade level, as 50th percentile, or as the midpoint of what- 
ever scale is being.used. Thus, a school with only 40 per cent of 
its pupils below grade level (rather than 50 per cent) is above 
average (assuming that most of the other 60 per cent are above 
grade'level). ; 
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In this example the principal made the very common error of 
translating test norms into, standards of achievement. He had 
assumed that all or almost all pupils should be reading at grade 
level. But grade level on test norms is simply a point (or score) 
dividing all the pupils in a grade into halves. One may or may 
hot like the way test norms are defined and developed, but it is 
hardly proper to interpret them as if they were something that 
they are not. Suggestions have been made from time to time 
that we need standards of achievement in such skill areas as 
reading, arithmetic, and language, so that we could compare 
individual or group achievement with those sundards. Such 
standards might be very useful, but attempting to change norms 
into such standards is an impossible task. 

Individual Pupil Interpretation 

This brochure is not directly concerned with all *of the 
ihtricacies of test interpretation. It is primarily concerned with 
test norms. But, ^since one of the purposes of test norms is to 
develop and communicate meaning and understandings about 
individual pupils, test interpretation^cannot be ignored. Our 
purpose here will be simply to point up those characteristics or 
.aspects of test norms themselves that must be kept in mind, and 
not to cover test interpretation in any great depth. 

There are two p aeral ways of using norms in test interpreta- 
tion. The first is to use norms as a yardstick for comparing a 
pupil with himself, for comparing his own high points .with his 
own low points, or for comparing his performance over time. If 
this is the chief point of interest, then the norms help us see in 
which areas a pupil scores high and in which he scores low. 

Part I of Figure 2 shows the profiles of two girls, Pam and 
Sally, who took an aptitude test having four parts, verbal, nu- 
merical, abstract, and mechanical. The letters A, B, C, and D are 
neither raw scores nor norms. They simply designate high and 
low performance for each pupil in relation to her own perform- 
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ance on these tests. Thus, both girls got their highest score on 
verbal; both got their lowest score on numerical (exactly three 
units below verbal) ; both had abstraa scores one unit below 
verbal (and two above numerica^l) ; and both had mechanical * 
scores two units below verbal (and one above numerical). Thus, 
with respect to their own strengths and weaknesses on the four 
aptitudes being measured, Pam and Sally are identical. 'If a 
counselor were discussing Pam's or Sally's profile with her, the 
counselor and counselee might come up with very similar in- 
terpretations in both situations: greatest strength in verbal 
ability, least in numerical, with abstract and mechanical in 
between. Perhaps that is sufficient, perhaps not* The whole 
point of this example is to illustrate the extreme situation of 
evaluating strengths and weaknesses in relation only to the 
individual himself* 

Figure 2 
Pupil Profiles, Part I 

ff 

Verbal Numerical Abstract Mechanical 
Reasoning Reasoning Reasoning Reasoning 




Pam and^Sally 



The other way of using norms in individual pupil interpreta- 
tion is to use them as a yardstick for comparing a pupil with 
other pupils similar to himself. The question here is not so 
much where one is high or low but whether one is high or 
average or low in relation to other pupils. The primary point 
concerns the relation of the individual to the group, rather than 
. the individual to himself. 
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Figure 2 
l*upit Profiles, Part 11 
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Part II of figure 2 shows that Pam's scores on this aptitude test 
were all high (from stanine 6 to stanine 9) and Sally's were all 
low (from stanine 1 to stanine 4) . Thus, in relation xo a national 
sample of all pupils of the same age and grade level, Pam scored 
very high on verbal reasoning, definitely above average on ab-* 
stract and mechanical, and high*average on numerical; Sally 
scored low-average on verbal, definitely below average on ab* 
stiact and mechanical, and very low on numerical. This interpre- 
tation says very different things about Pam and Sally, whereas 
the interpretation in Part I said the same thing for both. \ 

Both interpretations are correct. They differ simply because 
the yardsticks used were different. Pam and Sally have per- 
formed only once on this test. 

These two extreme examples were drawn in order to point 
up individual test interpretation in relation to self and in rela- 
tion to p^rs. Of course, there is no reason why a counselor or 
teacher prxadministrator cannot follow both courses, looking 
both for individual strengths and weaknesses and for the general 
ability or achievement level in relation to other pupils. 

X 44 , . 



ERJC 



SO 



The Case for Multiple Norms 

In the discussion of local norms the point made that there 
is no reason why a school should use only local W only national 
norms or only any other kind of norm. Each different kind of 
norm that is used has the potential of adding some Nbit of mean- 
ing to a total evaluation. 

Let's look again at Pam's scores on the four-part aptkude test 
that we exi>lored in the previous section (see Figure 3K Pam 
had stanines of 9, 6, 8, and 7 on the national norms. Let's assume 
that the- director of testing in her school system also hadXde- 
veloped local norms to judge the competition within the.scho<sJ. 
On the local norms her stanine scores Were 8, 4, 7, and 7. Ii 
general, then, her derived scores were a bit lower on Iqcal norms, ' 
which says that the school Pam attends has above average pupil$ 
on the aptitudes being measured, with the possible exception of 
the area of .Mechanical Reasoning. .Also, suppose that Pam's 
parents had attended State College and that\he and they are 
interested in how her abilities compare witlj^s^tudents attending 
State. Fortunately Pane's counselor has been^ble to obtain in- 
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formation from State with their own norms for this aptitude 
^ test. From that information he can tell Pam that her stanine 
^scores in relation to State freshmen are 7, 4, 5, and 5. 

There is value to be gained from each one of the different sets 
of norms. A very strong case can be made for the use of mul tiple 
nbrms— for forcing pupils and parents as well as counselors and 
teachers to look at test results from two or tliree or four different 
points of view rather than letting them see only a single per- 
centile, a single stanine, or a single grade placement scgre. 

Group I yiUrpr elation 

Tlie^general principles of individual test interpretation apply 
also to group interpretation.. One can analyze group results 
obtained within a single school system, and at the same time 
compare group results from one system to national or regional 
or state results! ' 

Wlien one speaks of comparing group results, one generally 
is talking about comparing one aspect of group results (one 
point in the distribution). The most common thing is to com- 
pare medians or means. Sometimes other points, such as the 
quartile points (25tli and 75tli^percen tiles) and the highest and 
lowest scores also are compared, any event a single point or 
a limited number of points in each di^ribution are compared. 

Suppose that tlie 1 1th graders in the higl^schbol of the Horace 
Mann School System produce the results in^Higure 4. Their 
average , (median) percentiles are 60 in EnglisIv^M in mathe- 
matics, 55 in social studies, and 45 in science. The nt^t^ype of 
comparison causes one to see that their higliest score is in-^g- 
lish, tlie next is in social studies,, tlie next in matliematics, ariih- 
tlie lowest in science. In this type of comparison one is con- * 
cerned only with rank order of performance witliin the single 
schpol system. Having established the rank order, one begins 
to question why it came out as it did. Does it simply represent 
chance, witli differences so small as to be insignificant? Or are 
the differences real? If so, why do Horace Mann pupils do their 
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best work in English and poorest in science? One can determine 
statistically whether the differences are significant or not. If 
they are, one then must rely upoil the Horace Mann principal 
and teachers and curriculum director to "explain" the differ- 
ences.. In this example the same rank order would have appeared 
if the percentiles had been 40 and 30 and 35 and 25 respectively 
(see Figure 4) . The rank order is the crucial point. 

Figure 4 
A Group Profile 
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In the second type of comparison one is concerned with the 
arithmetic size of the medians. Thus, one could say that, when 
compared to a national sample, Horace Mann pupils are average 
or above in English, mathematics, and social studies and below 
average in science. Or, it might be more reasonable to think of 
average as a range, not a point, and say that Horace Mann pupils 
were about average in mathematics, social studies, and science 
and were above average in English. In this comparison, a group 
comparison, a smaller variation from the median may be con- 
sidered significant than when working with individual results. 
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When using norms tor group comparisons ot^nfimt there are 
several points fib keep in mind. One of these is that in comparing 
achievement test results with national ftorms it is important to 
consider possible group differences in scholastic aptitude or 
general intelligence. In our Horace Mann example we saw that 
the school appeared average in tliree areas and above average 
in one. This statement assumes that the pupils in Horace Mann 
areof average ability. If they were of above average ability, their 
group acliievemeht would not look so good; if they were of below 
average ability, their group achievement would appear very fine 
indeed. Average ability level cannot be ignored when compar- 
ing group results with an external norm (national, ?*ate). Aver- 
age ability is of less importance if one is making high and low 
comparisons within one school. . 

Another very important point in the area of group ^valuation 
is that the total group or a random sample must be tested if 
group comparisions are to be made. It is not reasonable to test 
only part of a group and then draw inferences that apply to the 
total group. The most obvious deviation from this principle in 
recent years has been in relation to inferences some people have 
drawn from results on the National Merit Scholarship Qualify- 
ing Test (NMSQT) . The NMSQT is a test of general educa- 
tional developmenf, a type well suited to general curriculum 
evaluation; The authors have developed national norms for the 
test by relating it t(>another widely used test. Thus, it is possible 
to use the NMSQT ^or group* evaluation if a school tests an en- 
tire grade or a representative group with it However, itiost 
schools use a self-selection process, with many college-bound 
pupils and a few others taking the NMSQT. Such a sample is 
not at all typical of the total *grade, and the group results from 
such a sample are almost meaningless. There is no known refer- 
ence' group against which one can compare results. Even the 
norms built only on those pupils taking the NMSQT have very 
little meaning, because of the impossibility of defining the popu- 
lation exactly. Some schools test only a handful of pupils with 

48 



54 



BEST COPY AVAILABLE 



the NMSQT, some a sizable group, and ?ome everyone. The 
mixture makes comparisons impossible. \ 

Another point worth mentioning in relation to group norms 
is the use of summary statistics on group results for publicity 
purposes. Some systems take pride in letting their communities 
know that group test^ results are well above national norms: 
There is certainly nothing wrong in being proud of a job well 
done, providing the evidence actually says that the job has been 
well done. But both of the points above should be kept in mind 
in this, connection. Achievement test results at the 70th per- 
centile in English and mathematics and social studies and science 
are not outstanding if the average IQ also is at the 70th per- 
centile level. Having ten National Merit semifinalists in a large 
high school with an average IQ of 120 may not be any more 
laudatory than having one semifina|ist in a' high school with an 
average IQ of 90. Again, it is fine 'to be proud of outstanding 
group achievement and dutstanding individual achievement, but 
it is unrealistic for a school to assume all of the cre^dit 'for such 
achievement. / 

' . ' ^ / * 

Special Considerations 

In interpjeting test norms thef-e are a few more points to be 
. made. They do not apply just to individual use or just to group 
use. They are not considerations that are systematic and regular; 
rather they vary from situation to ^situation. The fint of these 
is not really a characteristic of norms at all; it is a characteristic 
of our society and particularly of parents in our society; it is a 
feeling. It is the feeling that somehow being "below average" is 
a stigma or a curse; that in fact being "average" in our society is 
not enough. Many parents hope— and really expect—that their 
children mil be above average. When a parent is faced with 
average or below average ^ades or test scores for his youngster, 
there often is a feeling of resentment, a feeling that the school 
has somehow let him down. The parent is apt to feel that he 
has provided the school with an above average boy or girl, and 
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that any^ilure to maintain that position is bound to be the fault 
of the school. This tendency on the part of so man, adults has 
importarjt implications for test interpretation. 

To satisfy this feeling, what is aeeded, perhaps, is a **psycho- 
logical norm*' that is low enough so that almost all pupils find 
themselves above that norm. To be more realistic, what may be 
needed is a series of objective standards of minimum achieve- 
ment for various grac-e levels »and/or subject areas that reason- 
ably can be met by most pupils. Until such standards are de- 
veloped we are left with the situation in which norms, by defini- 
tion, say that half of all pupils fall at or below the midpoint. 

Another consideration for a test user to keep in mind is clearly 
and specificaHy a test characl 'istic. The titl^ of a test does not 
define the content of the items in that t^t. Two reading tests 
designed for use in grade 7 may or may not measure the same 
'•ng skills. If one examines several of the widely uied read- 
injr .ests found within junior high achievement batteries, he will 
find one reading test consisting entirely of test items whicH 
relate to short selections that are read while the examinee is 
taking the test. He will find another reading test that includes 
some items of thac type but also includes items of a study*skills 
nature within the reading test. Reading scores from, two tests 
that differ systematically in the types of items they use represent 
performance on two different tasks. They should not be labeled 
. identically nor interpreted identically. 

It was mentioned in an earlier section that, despite the wwk 
of anchors and publishers in developing national norms for a 
test, sojne variation creeps into the norms from sampling in- 
adequacies. There are several types of systematic differences that - 
may occur. One of these is the variation that may occur in mean 
(or median) values from one set of norms to another. It might 
seem that such differences should not appear, since, by definition, 
the middle score for any set of norms is set at the median posi- 
tion. The reason variations occur is that no two samples, even 
from the same defined population, will be identical. 
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In Figure 5 let's assume that by some magic means we know 
the pcact national distribution of all 10th graders on their 
knowledge in general science. Testmaker I develops Test I 
for general science. Testmaker II develops Test II for general 
science. Both are good tests, but Test II has a slightly larger 
percentage of physics-type items, while Test I has a slightly 
larger percentafge of biology-type items. Test I is normed on 
10,000 pupils in 20 states representing all geographic areas and 
sizes of schools. Test II is normed on 20,000 pupils in 30 states 
representing all geographic areas and sizes of schools. None of 
the pupils tested with Test I was tested also with Test II in the 
norming process. 

Now assume that 10th greaders find biology-type items some- 
what easier to do than physics-type items. Also, assume that the 
10,000 pupils tested, with Test- 1 actually are slightly above aver- 
age in general science achievement. Putting these two assump- 
tions together we find that the median for the sample of 10,000 
is a bit above the "true" median for all 10th graders on all types 
of general science item's. This is pictured as Sample I in Figure 5. 

Again, assume that 10th graders find physics-type items a bit 
more difficult than biology-type items. Also, assume that the 
20,000 pupils who took Test II actually are slightly below aver- 
;^ge in general science achievement. Putting these two assump-^ 
tions together we find that the median for the sample of 20,000 
is a bit below the "true" median for all 10th graders on all types 
of general science items. This is pictured in Sample II in 
Figure 5. 

These conditions lead one to two sets of norms for the two 
different tests of general science though they are supposedly 
measuring the same achievement characteristic. Theoretically, 
if both tests sampled exactly the same skills and both normative 
samples ^vere identical, the two sets of norms would be identical. 
In practice they will vary a bit, as in our example. To extend 
.our example to the area of individual interpretation, one can 
imagine a pupil who is exactly average on our hypothetical 
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h a person would get a percentile rank of 
Test I, and of perhaps 53 or 54 on Test II, 
depending on !• . much each sample varied froin the "true*' 
situation. In like fashion a "true** percentile of 75 might be a 
70 on Test I and 80 on Test II. ("True** is defined here as the 
exact Knowledge or ability that an individual possesses; it can- 
hot be measured:) 

Very few formal studies have been made of mean differences 
in test norms. Test publishers are not in a position to make them, 
and few other agencies have the resources. The studies that have 
been made generally are within a single city or state. Certain state 
testing programs lend themselves to such studies. Since tests are 
revised and restandardized periodically and each restiindardiza- 
tion involves a different sample, there is need fcr continuing 
research on the comparability of norms for widely used standard- 
ized tests. 
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In Conclusion 



Almost inevitably a discussion like this one focuses on the 
problems m an area— and, in the very process of explaining and 
trying to simplify things, it may create the impression that those 
things are difficult and complicated. To be sure, there are some 
complexities in the sensible use and interpretation, of test norms, 
but they are rather modest ones and a person can learn to handle 
them with ease. 

Any\yay, even if there are some problems, they are certainly 
worth wrestling with, because the potential gains are so great. 
The scholarly, scientific work of the testmakers and the com- 
panies engaged in testing has built up a tremendous, un- 
precedented body of resources for American education. It en- 
ables us, as never before, to diagnose difficulties, make thought- 
ful predictions, and evaluate the successes or failures that grow 
out of our work. _ 

Surely it is worth real effort to capitalize on the possibilities of 
such resources. Yet most of their value can be thrown away— in 
fact, they can lead to damage— if we are not competent and wise 
in interpreting and using the results. Therefore, we cannot 
resist concluding with some generalized remarks which go some- 
what beyond the technical scope of this booklet. 

Despite the fact that tests are growing more accurate and pre- 
cise, year by year, the wisest educators still use their results with 
a. certain moderation. In view of all the possibilities for varia- 
tion, it is a little unreasonable ;o act as if a youngster*s recorded 
IQ, his score on a personality inventory, his percentile on an 
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aptitude test, or even his stanine placement on an achievement 
test represents exactly his true ability, his real characteristics 
and aptitudes, or his actual accomplishment. The state of his 
motivation, his health, his happiness, may have driven him un- 
usually loiv or high on a given day. .Maybe on another day— or 
with another tester— he might have scored quite differently. The 
test may nor even have fit him, or the curriculum of his school, 
very well. And then, witli the small sampling of knowledge in a 
given test, sheer luck may have run for him or against him.' 
Especially near the middle of the distribution a few items can 
make a striking difference in percentiles. (Look back to Table 
1, for instance; the difference between missing or solving two 
items is the difference b.etween the 40th and 60th percentiles.) 
On the whole, it is better to think in rather broader terms: 
"middle range," "high average," etc. The makers of one famous 
personality inventory specify that no T-score between 40 and 60 
is to be thought of as "different" from the mean. 

Furthermore, valuable as test data are, they are only one 
form of evidence.* After teachers and counselors and adminis- 
trators have lived and worked with a child^'for years— after their 
intuitions and judgments about him have slowly coalesced— and 
after he has accumulated a record of successes and failures, as 
represented by grades— it would be sheer folly to write off all 
such evidence in favor of one or a few scores on paper-and-pencil 
tests, no matter how' good those tests are. We have not yet 
developed any formal procedures to re^ilace the judgments of 
teachers and counselors and administrators; we do have formal 
procedures to provide evidence that will help to improve those 
judgments. 

None of this is to downgrade the enormous importance of good 
testing programs. Used together with all the other evidence we 
can get, test data are a marvelous aid to better planning and 
teaching. To us in the profession the great challenge is to use 
this new resource perceptively on behalf of every child and 
youth in our care. 
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